-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5341] CleanPlanner retains earliest commits must not be later than earliest pending commit #7568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @stream2000 |
|
@danny0405, @zhuanshenbsj1, I have created this pull request to fix the problem mentioned in #7405 in which the implementation is a little complex. PTAL. |
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java
Outdated
Show resolved
Hide resolved
LGTM. Clean clustering instant sequentially,to keep no clean holes in timeline like archive seems much more easily to implement. |
hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java
Show resolved
Hide resolved
|
@leesf, could you please review this pull request? I have addressed above comments from @danny0405. |
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java
Show resolved
Hide resolved
|
@SteNicholas Sorry for the delay. I'll review the PR soon. |
|
@yihua, any comments for this pull request? |
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to determine the earliest instant to archive wrt clustering seems to be okay but it is very complex now. It would be good to think about how to simplify this as a whole.
|
HUDI-5493 for revisiting the logic. |
|
@SteNicholas could you check the CI failure? |
hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java
Show resolved
Hide resolved
…han earliest pending commit
|
@yihua, I have already fixed the CI failure and updated the |
|
@hudi-bot run azure |
…han earliest pending commit (apache#7568)
…han earliest pending commit (apache#7568)
| if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LATEST_COMMITS | ||
| && commitTimeline.countInstants() > commitsRetained) { | ||
| earliestCommitToRetain = commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained); //15 instants total, 10 commits to retain, this gives 6th instant in the list | ||
| Option<HoodieInstant> earliestPendingCommits = hoodieTable.getMetaClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor. singular for "earliestPendingCommits"
| // Earliest commit to retain must not be later than the earliest pending commit | ||
| earliestCommitToRetain = | ||
| commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained).map(nthInstant -> { | ||
| if (nthInstant.compareTo(earliestPendingCommits.get()) <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the timestamp compare apis in HoodieTimeline
HoodieTimeline.compareTimestamps(commit1Ts, GREATER_THAN, commit2Ts)
Change Logs
CleanPlannershould retain the earliest commits must not be later than the earliest pending commit. Meanwhile,HoodieTimelineArchivershould retain the clustering commit which instant can not be archived unless we ensure that the replaced files have been cleaned, without the replaced files metadata on the timeline, the fs view would expose duplicates for readers.Impact
CleanPlannerretains the earliest commits must not be later than the earliest pending commit andHoodieTimelineArchiverretains the clustering commit which instant can not be archived unless we ensure that the replaced files have been cleanedRisk level (write none, low medium or high below)
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist